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In order to generate synthetic basket data sets for better benchmark test- 
ing, it is important to integrate characteristics from real-life databases into 
the synthetic basket data sets. The characteristics that could be used for this 
purpose include the frequent itemsets and association rules. The problem of 
generating synthetic basket data sets from frequent itemsets is generally re- 
ferred to as inverse frequent itemset mining. In this paper, we show that the 
problem of approximate inverse frequent itemset mining is NP-complete. 
Then we propose and analyze an approximate algorithm for approximate in- 
verse frequent itemset mining, and discuss privacy issues related to the syn- 
thetic basket data set. In particular, we propose an approximate algorithm to 
determine the privacy leakage in a synthetic basket data set. 



* An extended abstract of this paper has appeared in 1351 



Keywords: data mining, privacy, complexity, inverse frequent itemset 
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1 Introduction 

Since the seminal paper [2], association rule and frequent itemset mining received 
a lot of attention. By comparing five well-known association rule algorithms (i.e., 
Apriori Q, Charm flU, FP-growth O, Closet (221, and MagnumOpus 1531 ) 
using three real-world data sets and the artificial data set from IBM Almaden, 
Zheng et al. ||39ll found out that the algorithm performance on the artificial data 
sets are very different from their performance on real-world data sets. Thus there 
is a great need to use real-world data sets as benchmarks. 

However, organizations usually hesitate to provide their real-world data sets 
as benchmarks due to the potential disclosure of private information. There have 
been two different approaches to this problem. The first is to disturb the data be- 
fore delivery for mining so that real values are obscured while preserving statistics 
on the collection. Some recent work EIS0[I3HIH1|291EDI investigates the 
tradeoff between private information leakage and accuracy of mining results. One 
problem related to the perturbation based approach is that it can not always fully 
preserve individual's privacy while achieving precision of mining results. 

The second approach to address this problem is to generate synthetic basket 
data sets for benchmarking purpose by integrating characteristics from real- world 
basket data sets that may have influence on the software performance. The fre- 
quent sets and their supports (defined as the number of transactions in the basket 
data set that contain the items) can be considered to be a reasonable summary of 
the real- world data set. As observed by Calders Q, association rules for basket 
data set can be described by frequent itemsets. Thus it is sufficient to consider fre- 
quent itemsets only. Ramesh et al. E71l recently investigated the relation between 
the distribution of discovered frequent set and the performance of association rule 
mining. It suggests that the performance of association rule mining method us- 
ing the original data set should be very similar to that using the synthetic one 
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compatible with the same frequent set mining results. 

Informally speaking, in this approach, one first mines frequent itemsets and 
their corresponding supports from the real- world basket data sets. These fre- 
quent itemset support constraints are used to generate the synthetic (mock) data 
set which could be used for benchmarking. For this approach, private information 
should be deleted from the frequent itemset support constraints or from the mock 
database. The authors of [[71 [201 investigate the problem whether there exists a 
data set that is consistent with the given frequent itemsets and frequencies and 
show that this problem is NP-complete. The frequency of each frequent itemset 
can be taken as a constraint over the original data set. The problem of inverse 
frequent set mining then can be translated to a linear constraint problem. Linear 
programming problems can be commonly solved today in hundreds or thousands 
of variables and constraints. However, the number of variables and constraints in 
this scenario is far beyond hundreds or thousands (e.g., 2 t , where t is the number 
of items). Hence it is impractical to apply linear programming techniques directly. 
Recently, the authors of [|36ll investigated a heuristic method to generate synthetic 
basket data set using the frequent sets and their supports mined from the origi- 
nal basket data set. Instead of applying linear programming directly on all the 
items, it applies graph-theoretical results to decompose items into independent 
components and then apply linear programming on each component. One poten- 
tial problem here is that the number of items contained in some components may 
be still too large (especially when items are highly correlated each other), which 
makes the application of linear programming infeasible. 

The authors of |[27l l28l proposed a method to generate basket data set for 
benchmarking when the length distributions of frequent and maximal frequent 
itemset collections are available. Though the generated synthetic data set pre- 
serves the length distributions of frequent patterns, one serious limitation is that 
the size of transaction databases generated is much larger than that of original 
database while the number of items generated is much smaller. We believe the 
sizes of items and transactions are two important parameters as they may signifi- 
cantly affect the performance of association rule mining algorithms. 
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Instead of using the exact inverse frequent itemset mining approach, we pro- 
pose an approach to construct transaction databases which have the same size as 
the original transaction database and which are approximately consistent with the 
given frequent itemset constraints. These approximate transaction databases are 
sufficient for benchmarking purpose. In this paper, we consider the complexity 
problem, the approximation problem, and privacy issues for this approach. 

We first introduce some terminologies. X is the finite set of items. A trans- 
action over X is defined as a pair (tid, I) where I is a subset of X and tid is a 
natural number, called the transaction identifier. A transaction database V over 
X is a finite set of transactions over X. For an item set / C X and a transaction 
(tid, J), we say that (tid, J) contains I if I C J. The support of an itemset / in 
a transaction database V over X is defined as the number of transactions T in V 
that contains /, and is denoted support(I, V). The frequency of an itemset / in a 
transaction database V over X is defined as 

Calders flU |7]| defined the following problems that are related to the inverse fre- 
quent itemset mining. 

FREQSAT 

Instance: An item set X and a sequence (I\, fi), (I 2 , f 2 ), • • ; (I m , fm), where 
Ij C X are itemsets and < /, < 1 are nonnegative rational numbers, for all 

< i < m. 

Question: Does there exist a transaction database V overX such that freq(Ii, V) = 
fi for all < i < ml 

FFREQSAT (Fixed size FREQSAT) 

Instance: An integer n, an item set X, and a sequence (ii,/i), (h,f2), • • ■, 
(I mi fm)-, where ij C X are itemsets and < fi < 1 are nonnegative rational 
numbers, for all < i < m. 

Question: Does there exist a transaction database V over X such that V contains 
n transactions and freq(Ii, V) = fi for all < i < m? 
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FSUPPSAT 

Instance: An integer n, an item set X, and a sequence (Ji,si), (-/2,s 2 ), • "» 
(J m , s m ), where lj C X are itemsets and Sj > are nonnegative integers, for 
all < i < m. 

Question: Does there exist a transaction database V over X such that V contains 
n transactions and support(Ii, V) = for all < i < ml 

Obviously, the problem FSUPPSAT is equivalent to the problem FFREQS AT. 
Calders M showed that FREQSAT is NP-complete and the problem FSUPPSAT 
is equivalent to the Intersection Pattern problem IP: given annxn matrix C with 
integer entries, do there exist sets Sx,...,S n such that iS^nS^ = C[i,j]7 Though 
it is known that IP is NP-hard, it is an open problem whether IP belongs to NP. 

In this paper, we will consider the problem of generating transaction databases 
that approximately satisfy the given frequent itemset support constraints. Section 
[2]discusses the computational complexity of approximating transaction databases. 
Section [3] proposes an algorithm to approximately generate a approximate trans- 
action database. Section|4]discusses privacy issues and Section[5J Finally, Section 
[6] draws conclusions. 

2 Approximations 

Though it is an interesting problem to study whether there exists a size n transac- 
tion database that satisfies a set of given frequency constraints, it is sufficient for 
benchmarking purpose to construct a transaction database that is approximately 
at the size of n and that approximately satisfies the set of given frequency con- 
straints. Thus we define the following problem. 

ApproSUPPSAT 

Instance: An integer n, an item set I, and a sequence (/2,S2), • • 

(I m ,s m ), where Jj C X are itemsets and Sj > are nonnegative integers, for 
all < i < m. 

Question: Does there exist a transaction database V of n' transactions over X such 
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that | n — n'\ = 0(m) and \support(Ii, V) — Sj| = 0(m) for all < i < m? 

Note that in the above definition, the approximation errors are based on the 
parameter m instead of n since for most applications, m is small and n is big- 
ger. Indeed, n could be at the exponential order of m. For performance test- 
ing purpose, it is not meaningful to use n as the parameter in these situations. 
It also straightforward to show that the problem ApproSUPPSAT is equivalent 
to the following problem: given an integer n, an item set X, and a sequence 
(Ji, si), {1 2, sz), ■ ■ • , (I m , s m ), decide whether there exists a transaction database 
V over X with n transactions and < support(Ii,V) — Si = 0(m) for all 
< i < m. 

In the following we show that ApproSUPPSAT is NP-complete. Note that 
for the non-approximate version FSUPPSAT of this problem, we do not know 
whether it is in NP. 

Lemma 2.1 ApproSUPPSAT E NP. 

Proof. Since the size of the transaction database is n which might be exponential 
in the size of the instance input description, it is not possible to guess a transaction 
database in polynomial time and check whether it satisfies the constraints. In the 
following, we use other techniques to show that the problem is in NP. Let X be 
the collection of item sets and (Ji, s{), (J 2 , s 2 ), • • -, (J m , s m ) be the sequence of 
support constraints. Assume that |X| = t. Let Jq, Ji, ■ ■ • , J&-\ be an enumeration 
of the 2* subsets of X (in particular, let Jo = and J 2 *-i = X), and X , X\, . . ., 
X 2 t-\ be 2* variables corresponding to these itemsets. 

Assume that a transaction database V with n' = n + 0(m) transactions con- 
tains X{ itemset Jj for each < i < 2* and V approximately satisfies the support 
constraints (ix, si), (I 2 , s 2 ), ■ ■ ; (I m , s m ). Then there exists an integer k such that 
the following equations CO hold for some integer values X , . . . , X 2 t-i, Z , . . ., 
Z m . Similarly, if there is an integer k and an integer solution to the equations 
(OQ), then there is a transaction database V with n' = n + 0(m) transactions that 
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approximately satisfies the support constraints (Ii, Si), . . ., (I m , s 
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where A; is a large enough integer. In another word, if the given instance of the 
ApproSUPPSAT problem is satisfiable, then the equations (Q~|) have an integer so- 
lution. That is, the solution space for the equation (OQ) is a non-empty convex poly- 
hedron. A simple argument could then be used to show that there is an extreme 
point (X®, . . . , X®t) (not necessarily an integer point) on this convex polyhedron 
that satisfies the following property: 

• There are at most m+1 non-zero values among the variables X®, . . ., X^, 
Zq, ■ ■ ., Z m . 

Let Yi = [X$] be the closest integer to X 4 ° for 1 < i < 2* and V Y be the transac- 
tion database that contains Yi copies of the itemset Jj for each < i < 2*. Then 
V Y contains n + 0(m) transactions and \support(Ii, V) — Sj| = 0{m) for all 
< i < m. 

In another word, the given instance of the ApproSUPPSAT problem is sat- 
isfiable if and only if there exist itemsets J%, . . . , J m +i and an integer sequence 
Xi, . . . , x m+ i such that the transaction database V consisting of Xi copies of item- 
set Ji for each i < m witnesses the satisfiability. Thus ApproSUPPSAT 6 NP 
which completes the proof of Lemma. Q.E.D. 

Lemma 2.2 ApproSUPPSAT is NP-hard. 

'Similar argument has been used to prove the fundamental theorem of linear optimization in 
linear programming. See, e.g., |[T2ll24l . 
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Proof. The proof is based on an amplification of the reduction in the NP-hardness 
proof for FREQSAT in []6l which is alike the one given for 2SAT in |[P3~1 . In the 
following, we reduce the NP-complete problem 3-colorability to ApproSUPPS AT. 
Given a graph G — (V, E), G is 3-colorable if there exists a 3-coloring function 
c : V — > {R, G, B} such that for each edge (u, v) in E we have c(u) ^ c(v). 

For the graph G = {V, E), we construct an instance A(G) of ApproSUPPSAT 
as follows. Let m = 6\V\ + 3\E\, and n = kom 2 for some large k (note that 
we need k > k for the constant k we will discuss later). Let the itemset / = 
{R v , G v , B v : v E B} and the m support constraints are defined as follows. For 
each vertex v E V: 

support({R v }) = [2], support({G v }) = [f], 
support({B v }) = [2], 

suppor t ({ R v , G v }) = 0, support ({R v , B v }) = 0, 
support({G v , B v }) = 0. 

For each edge (u, v) E E: 

support ({R u , R v }) = 0, support ({G U ,G V }) = 0, 
support({B u , B v }) = 0. 

In the following, we show that there is a transaction database V satisfying this 
ApproSUPPSAT problem if and only if G is 3-colorable. 

Suppose that c is a 3-coloring of G. Let T be a transaction defined by letting 

Ti = {C v : v E V} where 

f R v ifc(v) = R; 
C v —def I G v if c(v) = G; 

[ B v ifc(v) = B. 

Let transactions T 2 and T 3 be defined by colorings c' and c" resulting from cycli- 
cally rearranging the colors R, G, B in the coloring c. Let the transaction database 
V consist of [|] copies of each of the transaction T 1; T 2 , and T 3 (we may need to 
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add one or two additional copies of 7\ if 3[|] ^ n). Then V satisfies the Appro- 
SUPPSAT problem A(G). 

Suppose V is a transaction database satisfying the ApproSUPPSAT problem 
A(G). We will show that there is a transaction T in V from which a 3-coloring of 
G could be constructed. Let X\ be the collection of itemsets defined as 

Xi = R,, 54, B„}:«G V}U 

R v }, {G u , G v }, {B u , B v } : (u, v) G E}. 

That is, X\ is the collection of itemset that should have support according to the 
support constraints. Since V satisfies A(G), for each /' G X x , support(I' : V) = 
is approximately satisfied. Thus there is a constant k x > such that at most fep x 
|Xi| = 3fcim(|V| + \E\) transactions in V contain an itemset in X 1 . Let V 1 be 
the transaction database obtained from V by deleting all transactions that contain 
itemsets from X x . Then T>i contains at least n — ?>k-im(\V\ + \E\) transactions. 

For each vertex v G V, we say that a transaction (tid, J) in V does not contain 
v if J does not contain any items from {R v , G v , B v }. Since V satisfies A(G), 
for each v G V, approximately one third of the transactions contain R v (G v , B v , 
respectively). Thus there is a constant k 2 > such that at most 3k 2 m x |V| 
transactions in V do not contain some vertex v G V. In another word, there are at 
least n — 3k 2 m x \V\ transactions J in V such that J contains v for all v £ V. 

Let £> 2 be the transaction database obtained from V 1 by deleting all transac- 
tions J such that J does not contain some vertex v £ V. The above analysis 
shows that T> 2 contains at least n — 3kim(\V\ + \E\) — 3k 2 m\V\ transactions. Let 
k = max{ki, k 2 }. Then we have 

\V 2 \ > n - 3km{\V\ + \E\) - 3km\V\ 
= n - km(6\V\ + 3\E\) 
= n — km 2 
= 3 ■ k m 2 — km 2 

By the assumption of k at the beginning of this proof, we have \V 2 \ > 1. For any 
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transaction J in V 2 , we can define a coloring c for G by letting 




R 



G 



B 



if J contains R 



if J contains G 



if J contains 5 



V 



V 



By the definition of V 2 , the coloring c is defined unambiguously. That is, G is 
3-colorable. 



We showed that the problem ApproSUPPSAT is NP-hard. In the proof of 
Lemma [2T2l we use the fact that the number n of transactions of the target basket 
database is larger than the multiplication of the number m of support constraints 
and the approximate error 0(m) (that is, n is in the order of 0(m 2 )). In practice, 
the number n may not be larger than km 2 . Then one may wonder whether the 
problem is still NP-complete. If n is very small, for example, at the order of 0(m), 
then obviously, the problem ApproSUPPSAT becomes trivial since one can just 
construct the transaction database as the collection of n copies of the itemset X 
(that is, the entire set of items). This is not a very interesting case since if n is at 
the order of m, one certainly does not want the approximate error to be at the order 
of n also. A reasonable problem could be that one defines a constant number 7 to 
replace the approximate error 0(m). Then the proof in Lemma |2T2l shows that the 
problem ApproSUPPSAT with approximate error 7 (instead of 0(m)) is still NP- 
complete if n > 7m. Tighter bounds could be achieved if weighted approximate 
errors for different support constraints are given. 

3 Generating approximate transaction databases 

In this section, we design and analyze a linear program based algorithm to ap- 
proximate the NP-complete problem ApproSUPPSAT. Let X = {ei, . . . , e t } be 



This completes the proof for NP-hardness of ApproSUPPSAT. 



Q.E.D. 



Theorem 2.3 ApproSUPPSAT is NP-complete. 



Proof. This follows from Lemma [2TT1 and Lemma [2T2l 



Q.E.D. 
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the collection of items, n be the number of transactions in the desired database 
V, and (Ii, Si), (J 2 , s 2 ), • • •, (I m , s m ) be the sequence of support constraints. Ac- 
cording to the proof of Lemma 12.11 if this instance of ApproSUPPSAT is solv- 
able, then there is a transaction database V, consisting of at most m+1 itemsets 
Ji, . . . , J m +x, that satisfies these constraints. Let X 1; . . . , X m+1 be variables rep- 
resenting the numbers of duplicated copies of these itemsets in V respectively. 
That is, V contains Xi copies of J« for each i. For alH < m and j < m + 1, let 
Xij and yij be variables with the property that X{j = Xj x y i j and 



1 if L C J, 



Vi,3 



j ■ 



otherwise. 



(2) 



Then we have support(Ii, V) = x^i + ■ ■ ■ + a?i, m +i and the above given Appro- 
SUPPSAT instance could be formulated as the following question. 



minimize zi + z 2 + ■ ■ ■ + z n 



(3) 



subject to 



(4) 



X\ + x 2 + ■ ■ ■ + X m+ i 

$i + %i + ' ' ' + m+1; 

Dij = 1 if Ii C Jj and y iy j = otherwise, 

Zi, Xj are nonnegative integers, 

for i < m and j < m + 1. 

The condition set © contains the nonlinear equation Xij = Xj x y it j and 
the nonlinear condition specified in d2j) . Thus in order to approximate the given 
ApproSUPPSAT instance using linear program techniques, we need to convert 
these conditions to linear conditions. 

We first use characteristic arrays of variables to denote the unknown itemsets 
Ji, . . . , J m +\. For any itemset I C I, let the t-ary array 6 {0, 1}' be the 
characteristic array of /. That is, the i-th component xCOK = 1 if an d only 
if ej G /. Let x(Ji) = (u h i, . . . ,u 1)t ), . . ., x(Jm+i) = (Vu.--iVi,i) 
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be a collection of (m + l)t variables taking values from {0, 1}, representing the 
characteristic arrays of J 1; . . . , J m +i respectively. 

In order to convert the condition specified in © to linear conditions, we first 
use inner product constraints to represent the condition Ii C Jj. For two charac- 
teristic arrays xi and X2, their inner product is defined as Xi • Xi — Xi [1] ' X2[l] + 

H Xi[t) ' X2[t] - ^ * s straightforward to show that for two itemsets I, J CI, we 

have x(I) ' x{J) < min{|J|, \J\} and x(I) ' x{J) — l-^l if an d on ly if / C J. 

Now the following conditions in © will guarantee that the condition in © is 
satisfied. 

\h\ ■ Vi,j < x(Jj) ■ X(li) < Vij + - 1 

Vij, Uj,k e {o, 1} 

for all % < m, j < m + 1, and k < t. The geometric interpretation of this 
condition is as follows. If we consider (x{Jj) • x(h),Ui,j) as a point in the 2- 
dimensional space (x,y) shown in Figure[Tl then \Ii\y < x defines points below 
the line passing the points (0, 0) and (|Jj|, 1), and x < y + — 1 defines the 
points above the line passing through the points — 1,0) and 1). Thus 
yij = 1 if and only if x{Jj) ' x(h) = \h\- That is, y^j = 1 if and only if I { C Jj. 



y 





0BI.1) 




(0,0) 


(ini-i,o) 



Figure 1: Triangle 



The nonlinear equations Xij = Xj x y itj can be converted to the following 
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conditions consisting of inequalities. 

* ni/ij + Xj — Xij < n, (6) 

for alH < m and j < m + 1. The constant n is used in the inequalities due to the 
fact that Xj < n for all j < m + 1. The geometric interpretation for the above 
inequalities is described in the following. If we consider (x it j, y it j, Xj) as a point 
in a 3-dimensional space (x, y, X) shown in Figure [2l then 

1. x — ny = defines the plane passing through points (0, 0, 0), (0, 0, n), and 
(n, 1, n); Thus x^j — ny i: j < guarantees that x it j = if y^j = 0. 

2. X > x defines the points above the plane passing through points (0, 0, 0), 
(0, 1,0), and (n, l,n). This condition together with the condition y^j e 
{0, 1} guarantees that x i: j < Xj when y it j = 1. 

3. ny + X — x < n defines the points below the plane passing through points 
(0, 1, 0), (0, 0, n), and (n, 1, n). This condition together with the condition 
Vi,j £ {0, 1} guarantees that x i: j > Xj when y iy j = 1. Together with the 
condition[2l we have x^j = Xj when y it j = 1. 

Note: For the reason of convenience, we introduced the intermediate variables 
y it j. In order to improve the linear program performance, we may combine the 
conditions (0) and © to cancel the variables y^j. 

Thus the integer programming formulation for the given ApproSUPPSAT in- 
stance is as follows. 

minimize z 1 + z 2 + • • ■ + z m (7) 
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X 




x 



(0, 0, 0) 

Figure 2: Tetrahedron 

subject to conditions ©, ©, and 

X 1 +X 2 + --- + X m+l = n, 

< Si + Zi = X itl -\ hXi )OT+ i, (8) 

Zi, Xj are nonnegative integers, 

for i < m and j < m+1. We first solve the linear relaxation of this integer 
program. That is, replace the second equation in the condition (0) by 

< Vij, Uj t k < 1 for all i < m, j < m + 1, and k < t 

and replace the third equation in the condition ® by 

Zi , Xj > 0. 

Let o* = {(Uj k , y*j, x* tj , z*,X*) : i < m, j < m + 1, k < t} denote an optimal 
solution to this relaxed linear program. There are several ways to construct an 
integer solution o from o*. Let OPT(z; I) denote the optimal value of Z\ + • • • + 
z m for a given ApproSUPPSAT instance / and OPT(z; I) be the corresponding 
value for the computed integer solution. For an approximation algorithm, one may 
prefer to compute a number a such that 

OPT{z;I) < aOPT(z;I). 
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Theorem 12.31 shows that it is NP-hard to approximate the ApproSUPPSAT by an 
additive polynomial factor. Thus OPT(z; I) is not in the order of 0{m) in the 
worst case for any polynomial time approximation algorithms, and it is not very 
interesting to analyze the worst case for our algorithm. 

In the following, we first discuss two simple naive rounding methods to get 
an integer solution o from o*. We then present two improved randomized and 
derandomized rounding methods. 

Method 1: rounding w* fc 

Construct an integer solution o = (%fc, y^j, x it j, z h Xj) by rounding u* k to their 
closest integers, rounding X* to their almost closest integers so that X\ + ■ • ■ + 
X m+ i = n, and computing y it j, x i: j, and Z{ according to their definitions. That is, 
for each j < m + 1 and k < t set 




1 ifw* fc >0.5, 
otherwise. 



For the rounding of X*, first round X* to their closest integers [X*]. Then ran- 
domly add/subtract l's to/from these values according to the value of X\ + h 

X m+ i — n until X\ + ■ ■ • + X m+1 = n. 

From the construction, it is clear that o is a feasible solution of the integer pro- 
gram. The rounding procedure will introduce the following errors to the optimal 
solution: 

1. By rounding {u* k : i < m, k < t}, the values in {x{h) • x(-Jj) '■ i < m , j < 
m + 1} change. Thus the values in {y it j : i < m, j < m + 1} will change. 
Thus the values in {xij : i < m,j < m + 1} will be different from the 
values in {x*j : i < m, j < m + 1} . 

2. By rounding {X* : j < m + 1}, the values of {xij : i < m, j < m + 1} 
will change also. 
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Method 2: rounding x*j 

Construct an integer solution o = y~ij, x i: j, Xj) by rounding x*j to or 
X* and computing the other values according to their definitions or relationships. 
That is, first round X* to their closest integers [Xj}. Then randomly add/subtract 
l's to/from these values according to the value of X\ + - ■ ■ + X m+ i — n until 
X\ H h X m+ i = n. Now round x*j as follows. Let 




Xj ifx*j>0.5Xj, 
otherwise. 



Jj's could be computed by setting 

The values of Uj^ and y^j can be derived from Jj easily. We still need to further 
update the values of x it j by using the current values of y iy j since we need to satisfy 
the requirements Xjj = Xj x y^j. 

From the construction, it is clear that o is a feasible solution of the integer pro- 
gram. The rounding procedure will introduce the following errors to the optimal 
solution: 

1. By rounding {x*j : i < m, j < m + 1}, we need to update the values of 
y~ij, which again leads to the update of values of x^j, 

2. By rounding {X* : j < m + 1}, the values in {xij : i < m, j < m + 1} 
will change also. 

Method 3: randomized and derandomized rounding 

For quite a few NP-hard problems that are reduced to integer programs, naive 
round methods remain to be the ones with best known performance guarantee. 
Our methods 1 and 2 are based on these naive rounding ideas. In last decades, 
randomization and derandomization methods (see, e.g., [I3T1I231 ) have received a 
great deal of attention in algorithm design. In this paradigm for algorithm design, 
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a randomized algorithm is first designed, then the algorithm is "derandomized" 
by simulating the role of the randomization in critical places in the algorithm. In 
this section, we will design a randomized and derandomized rounding approach to 
obtain an integer solution o from o* with performance of at least the expectation. 
It is done by the method of conditional probabilities. 

In rounding method 1, we round u* k to its closest integer. In a random round- 
ing ll26l . we set the value of Uj^ to 1 with probability w* fc and to with probability 
1 — u* k (independent of other indices). 

In rounding method 2, we round x*j to the closest value among and Xj. In 
a random rounding 1(261 . we set the value of Xij to Xj with probability jf- and to 

with probability 1 — (independent of other indices). 

A random rounding approach produces integer solutions with an expected 
value zq for Y%Li z i- An improved rounding approach (derandomized rounding) 
produces integer solutions with YliLi %i guaranteed to be no larger than the ex- 
pected value zq. In the following, we illustrate our method for the random round- 
ing based on the rounding methods 1 and 2. 

Randomized and derandomized rounding of x*j. We determine the value of 
an additional variable in each step. Suppose that {x iy j : (i, j) E I } has already 
been determined, and we want to determine the value of x io j with (i , j ) £ I . 
We compute the conditional expectation for YhL\ z i of this partial assignment first 
with x i0i j set to zero, and then again with it set to X jo . If we set x io j according 
to which of these values is smaller, then the conditional expectation at the end of 
this step is at most the conditional expectation at the end of the previous step. This 
implies that at the end of the rounding, we get at most the original expectation. 

In the following, we show how to compute the conditional expectation. At 
the beginning of each step, assume that for all entries (i',f) in I , Xi/ji has been 
determined already and we want to determine the value of x io j for (i , j ) £ I 
in this step. 

In order to compute the conditional expectation of YhL\ %h we first compute 
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the probability Prob[/j C Jj] for all (i, j) I . For each j < m + 1, let 

J? = U h 

If Ii C J°, then we have Ii C Jj and Prob[/j C Jj] = 1. Otherwise, continue with 
the following computation. By regarding ^ as the probability that x*j takes the 

value Xj, we know that with at least probability jf- we have Ii C J^. However, 
the actual probability may be larger since other entries 1^ with 7j n ij' 7^ may 
contribute items to Jj, which may lead to the inclusion of U in Jj. First we define 
the following sets. 

L id = {l,...,i-l,i + l,...m}\{i' : (i'J)eh} 
Ui,j = \KCLij: J,C J«U (J 'i' 

and 

[/^ = {K e Uij : there is no K' G E/"^ such that K' C K}. 
For each K e £/',, let 

p(U,*) = n 

Then the probability Prob[/j C Jj] can be approximated as 

ProbfJ, C Jj] = X M + (l- X M\ £ p(i , j5 K ). 
^3 \ ^3 ) KeV . 

Note that we say that we approximate the probability ProbfJ; C Jj] since in the 
computation, we assume that Prob[/;/ C Jj] = -~4- for other i! which may not be 
true. If necessary, we can improve the approximation by iteration. That is, repeat 
the above procedure for several rounds and, in each round, use the approximated 
probabilities for Prob[/j/ C Jj] from the previous round. If sufficient rounds are 
repeated, the probability will converge in the end. 

Since we have the probabilities Prob[/j C Jj] for all £ I now, it is 

straightforward to compute the conditional expectation of E(Y^=i Zi) = YhLi E(zi) 
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The expected value for z { is 

(m+l \ m+l 
3=1 J 3=1 

Randomized and derandomized rounding of u* k . We determine the value of 
an additional variable in each step. Suppose that {%& : (J, k) G Iq] has already 
been determined, and we want to determine the value of Uj 0> f. with (j , k ) £ J . 
We compute the conditional expectation for Y%Li %i of this partial assignment first 
with Mj OJO set to zero, and then again with it set to 1. If we set % ,fc according to 
which of these values is smaller, then the conditional expectation at the end of this 
step is at most the conditional expectation at the end of the previous step. This 
implies that at the end of the rounding, we get at most the original expectation. 

According to our analysis in the randomized and derandomized rounding of 
x*j, it is sufficient to compute the probability Prob[/j C Jj] for all Assume 
1 = {e x ,...,e t } and/j = {e h , e i|7J }. Set 

Prob[/i C Jj] = ujfy x • ■ ■ x u jtiw 

where 

. _ | if (j,i s ) e I , 

J ' ls I u* : otherwise 

for s < Using Probf/j C Jj], one can compute the conditional expectation of 
YhL\ z i as in the case for rounding of x*j. 

Complexity analysis of the approximation algorithm 

In the integer linear program formulation of our problem, we have t(m+ 1) vari- 
ables Uj t k, m + l variables Xj, m(m + 1) variables Xij, m{m + 1) variables y it j, 
and m variables z^. In total, we have t(m + 1) + 2m 2 + Am + 1 variables. 

There are (m + l)(2m + t) constraints in the condition ©, 4m(m + 1) con- 
straints in the condition ©, and 3m + 2 constraints in the condition ©. Thus we 
have 6m 2 + 9m + mt + t + 2 constraints in total. 
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The rounding, randomized, and derandomized rounding algorithms could be 
finished in 0(tm 3 ) steps. Thus the major challenge is to solve the relaxed con- 
tinuous variables linear program. According to |[T9l , hundreds of thousands of 
continuous variables are regularly solved. Thus our approximation algorithm are 
efficient when m and t takes reasonable values. 

4 Privacy issues 

Wang, Wu, and Zheng ll34ll considered general information disclosure in the pro- 
cess of mock database generation. In this section, we discuss privacy disclo- 
sures in synthetic transaction databases. Confidential information in transaction 
databases may be specified as a collection of itemsets and their corresponding 
support (frequency) intervals. Let V be a set defined as follows. 

V = {(h,s l ,S l ) :hCX,i< I}. 

We say that a (synthetic) transaction database V does not disclose confidential 
information specified in V if one cannot infer that 

Si < support (I f,V) < Si 

for all (ij, Si, Si) E V. Similarly, we say that a support constraint set S = 
s\), . . . , (I' m , s m )} does not disclose confidential information specified in V 
if for each element (Jj, Sj, Si) G V, there is a transaction database T>i that satisfies 
all support constraints in S and 

support(I h T>i) <£ [si,Si]. 

For the synthetic transaction database generation, there are two scenarios for 
potential private information disclosure. In the first scenario, the database owner 
uses the following procedure to generate the synthetic transaction database: 

1 . use a software package to mine the real- world transaction database to get a 
set of itemset support (frequency) constraints; 
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2. use a software package based on our linear program methods to generate a 
synthetic transaction database V from the support (frequency) constraints; 

3. release the synthetic transaction database V to the public. 

In this scenario, the mined support (frequency) constraints are not released to 
the public and only the synthetic transaction database is released. In this case, 
it is straightforward to protect the confidential information specified in V. The 
database owner proceeds according to the above steps until step 3. Before releas- 
ing the synthetic transaction database V, he can delete the confidential information 
as follows. 

• For each (ij, s i: Si) G V, chooses a random number r\ < n, where n is the 
total number of transactions. We distinguish the following two cases: 

1. If «i = support(Ii, V) — ri < 0, then chooses a random series of — Ui 
transactions i, that do not contain the itemset Ii, and modify these 
transactions to contain the itemset Jj. 

2. If Ui = support(Ii, V) — Ti > 0, then chooses a random series of Ui 
transactions i, that contain the itemset I { , and modify these transac- 
tions in a random way so that they do not contain the itemset 

After the above process, the resulting transaction database contains no confidential 
information specified in V and the database owner is ready to release it. 

In the second scenario, the database owner uses the following procedure to 
generate the synthetic transaction database: 

1 . use a software package to mine the real- world transaction database to get a 
set of itemset support (frequency) constraints; 

2. release the support (frequency) constraints to the public; 

3. a customer who has interest in a synthetic transaction database generates 
a synthetic transaction database V from the published support (frequency) 
constraints using a software package based on our linear program methods. 
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In this scenario, the mined support (frequency) constraints are released to the 
public directly. Thus the database owner wants to make sure that no confidential 
information specified in V is contained in these support (frequency) constraints. 
Without loss of generality, we assume that there is a single element (/, s, S) in 
V and the mined support constraints are S = {(/j, si) : i < m}. S contains the 
confidential information (/, s, S) if and only if for each transaction database V 
which is consistent with S, we have support(I; V) G [s, S]. In another word, S 
does not contain the confidential information (J, s, S) if and only if there exists 
an integer s' with s' < s or S < s' < n such that 5 U {(/, s')} is consistent. 
That is, there is a transaction database V that satisfies all support constraints in 
S U {(/, s')}. In the following, we show that there is even no efficient way to 
approximately decide whether a given support constraint set contains confidential 
information. We first define the problem formally. 

ApproPrivacy 

Instance: An integer n, an item set X, a support constraint set S — s^), ■ ■ ■, 
(I' m , s' m )}, and a set V = {(L h s h Si) :hQT : i< I}. 

Question: For all transaction database Vofn transactions over X with | support^, V) — 
s'tl = 0(m) for all < i < m, do we have support(Ii, V) e [sj, Si] for all i < /? 
If the answer is yes, we write S \= a V. 

By Theorem I2.3L we have the following result. Similar NP-hardness results 
for exact frequency constraints inference have been obtained in [1611711201. 

Theorem 4.1 ApproPrivacy is coNP-complete. 

Proof. S ^ a ? if and only if there is a transaction database V and an in- 
dex j < I such that V satisfies S U {(Ij, support (1 ),T>) < s^} or V satisfies 
S U {(Jj, support(Ij, V) > Si)} approximately. Thus the theorem follows from 
Theorem O Q.E.D. 

Thus there is no efficient way for the database owner to decide whether a sup- 
port constraint set S leaks confidential information specified in V. In practice, 
however, we can use the linear program based approximation algorithms that we 
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have discussed in Section [3] to compute the confidence level about private infor- 
mation leakage as follows. 

1. Convert the condition S U {(I, s') : s' < s or S < s' < n} to an integer 
linear program in the format of ®. Note that the condition "s' < s or S < 
s' < n" is equivalent to the existential clause 3s' ((V < s) V (S < s' < n)). 
Thus it is straightforward to convert it to integer linear program conditions. 

2. Let the confidence level be c = Y%Li z %- The smaller c, the higher con- 
fidence. In the ideal case of c = 0, we have found an itemset transac- 
tion database V that witnesses that no confidential information specified by 
(J, s, S) is leaked in S. 

If the database owner thinks that the confidence value c = Y^=\ z% obtained in the 
above procedure is too larger (thus confidence level is too low). He may use the 
following procedure to delete potential confidential information from the support 
constraint set. 

1. Let i be the number that maximizes max^ , Si )es \I H h\. 

2. Modify the value Sj to be a random value. 

3. Approximately revise support constraint values in S to make it consistent. 
For example, to make it satisfy the monotonic rule. Since it is NP-hard to 
determine whether a support constraint set is consistent, we can only revise 
the set S to be approximately consistent. 

It should be noted that after the above process, the resulting support constraint 
set may become inconsistent. Thus in the next round, the value c = Y^=i z % 
may be larger. If that happens, the larger value c does not interpret as the privacy 
confidence level. Instead, it should be interpreted as an indicator for inconsistency 
of the support constraint set. Thus the above privacy deletion procedure should 
only be carried out one time. 

We should note that even if the confidence level is higher, (that is, c = YaL\ Zi 
is small), there is still possibility that the confidential information specified by 
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(J, s, S) is leaked in theory. That is, for each transaction database V that satisfies 
the constraints S, we have support(I, V) £ [s, S). However, no one may be able 
to recover this information since it is NP-hard to infer this fact. Support constraint 
inference has been extensively studied by Calders in [[6l [Vj . 

It would be interesting to consider conditional privacy-preserving synthetic 
transaction database generations. That is, we say that no private information is 
leaked unless some hardness problems are solved efficiently. This is similar to the 
methodologies that are used in public key cryptography. For example, we believe 
that RSA encryption scheme is secure unless one can factorize large integers. 

In our case, we may assume that it is hard on average to efficiently solve 
integer linear programs. Based on this assumption, we can say that unless integer 
linear programs could be solved efficiently on average, no privacy specified in V 
is leaked by S if the computed confidence level c = Y^Li z % is small. 

5 Related Work 

Privacy preserving data mining has been a very active research topic in the last 
few years. There are two general approaches mainly from privacy preserving 
data mining framework: data perturbation and the distributed secure multi-party 
computation approach. As the context of this paper focuses on data perturbation 
for single site, we will not discuss the multi-party computation based approach for 
distributed cases (See E3l for a recent survey). 

Agrawal and Srikant, in @), first proposed the development of data mining 
techniques that incorporate privacy concerns and illustrated a perturbation based 
approach for decision tree learning. Agrawal and Agrawal, in [[Q, have provided 
a expectation-maximization (EM) algorithm for reconstructing the distribution of 
the original data from perturbed observations. They provide information theo- 
retic measures to quantify the amount of privacy provided by a randomization 
approach. Recently, Huang et al. in lfT5l . investigated how correlations among 
attributes affect the privacy of a data set disguised via the random perturbation 
scheme and proposed methods (PCA based and MLE based) to reconstruct orig- 
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inal data. The objective of all randomized based privacy-preserving data mining 
Q3ffl[IH[l8l|29l is to prevent the disclosure of confidential individual values while 
preserving general patterns and rules. The idea of these randomization based ap- 
proaches is that the distorted data, together with the distribution of the random 
data used to distort the data, can be used to generate an approximation to the orig- 
inal data values while the distorted data does not reveal private information, and 
thus is safe to use for mining. Although privacy preserving data mining consid- 
ers seriously how much information can be inferred or computed from large data 
made available through data mining algorithms and looks for ways to minimize 
the leakage of information, however, the problem how to quantify and evaluate 
the tradeoffs between data mining accuracy and privacy is still open [ITOll . 

In the context of privacy preserving association rule mining, there have also 
been a lot of active researches. In |[5l [H, the authors considered the problem of 
limiting disclosure of sensitive rules, aiming at selectively hiding some frequent 
itemsets from large databases with as little impact on other, non- sensitive frequent 
itemsets as possible. The idea was to modify a given database so that the support 
of a given set of sensitive rules decreases below the minimum support value. Sim- 
ilarly, the authors in [30] presented a method for selectively replacing individual 
values with unknowns from a database to prevent the discovery of a set of rules, 
while minimizing the side effects on non-sensitive rules. The authors studied the 
impact of hiding strategies in the original data set by quantifying how much in- 
formation is preserved after sanitizing a data set [f2TTl . The authors, in IfTTT |29ll , 
studied the problem of mining association rules from transactions in which the 
data has been randomized to preserve privacy of individual transactions. One 
problem is it may introduce some false association rules. The authors, in [fT7l[32l . 
investigated distributed privacy preserving association rule mining. Though this 
approach can fully preserve privacy, it works only for distributed environment and 
needs sophisticated protocols (secure multi-party computation based |[371 ), which 
makes it infeasible for our scenario. 

Wu et al. have proposed a general framework for privacy preserving database 
application testing by generating synthetic data sets based on some a-priori knowl- 
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edge about the production databases [|34ll . The general a-priori knowledge such as 
statistics and rules can also be taken as constraints of the underlying data records. 
The problem investigated in this paper can be thought as a simplified problem 
where data set here is binary one and constraints are frequencies of given frequent 
itemsets. However, the techniques developed in Il34)| are infeasible here as the 
number of items are much larger than the number of attributes in general data 
sets. 

6 Conclusions 

In this paper, we discussed the general problems regarding privacy preserving syn- 
thetic transaction database generation for benchmark testing purpose. In particu- 
lar, we showed that this problem is generally NP-hard. Approximation algorithms 
for both synthetic transaction database generation and privacy leakage confidence 
level approximation have been proposed. These approximation algorithms include 
solving a continuous variable linear program. According to [19], linear problems 
having hundreds of thousands of continuous variables are regularly solved. Thus 
if the support constraint set size is in the order of hundreds of thousands, then 
these approximation algorithms are efficient on regular Pentium-based comput- 
ers. If more constraints are necessary, then more powerful computers are needed 
to generate synthetic transaction databases. 
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